2. Azure Tables Versus Traditional Databases
Usually, the people who experience the least amount of trouble moving
from their existing storage systems to Azure’s Table service are the
ones who accept that this is not a normal database system, and don’t
expect to find typical SQL features.
Let’s take a quick look at a brief comparison of Azure tables
versus traditional database tables.
2.1. Denormalized data
In a traditional database system, DBAs go to great lengths to
ensure that data isn’t repeated, and that data is consistent. This is
done through a process called normalization. Depending on how much time you have and how
obsessive you are, you could normalize your data to multiple levels.
Normalization serves a good purpose: it ensures data integrity. On the
other hand, it hurts performance.
Data stored in one Azure table does not have any relationship
with data stored in another Azure table. You cannot specify foreign
key constraints to ensure that data in one table is always consistent
with data in another table. You must ensure that your schema is
sufficiently denormalized.
Any user running a high-performance database system has probably
denormalized parts of that system’s schema, or at least experimented
with it. Denormalization involves storing multiple copies of data and
achieving data consistency by performing multiple writes. Though this
gives the user better performance and flexibility, the onus is now on
the developer to maintain data integrity. If you forget to update one
table, your data could become inconsistent, and difficult-to-diagnose
errors could show up at the application level. Also, all the extra
writes you must do take time and can hurt performance.
Note: In forums, in books, and on the blogosphere, you may see some
“experts” recommending that people denormalize to improve
performance. No one takes the trouble to explain why this can help.
The reason behind it is simple: data in different tables is
typically stored in different files on disk, or sometimes on
different machines. Normalization implies database joins that must
load multiple tables into memory. This requires reading from
multiple places and, thus, hurts performance. One of the reasons
Azure tables provide good performance is that the data is
denormalized by default.
If you’re willing to live with very short periods of
inconsistency, you can do asynchronous writes or move the writes to a
worker process. In some sense, this is inevitable. All major Web 2.0
sites (such as Flickr) frequently run tools that check for data
consistency issues and fix them.
2.2. No schema
Several advantages come with a fixed schema for all your data. Schema can act as a safety
net, trading flexibility for safety—any errors in your code where you
mismatch data types can be caught early. However, this same safety net
gets in your way when you have semistructured data. Changing a table’s
structure to add/change columns is difficult (and sometimes
impossible) to do (ALTER TABLE is
the stuff of nightmares).
Azure tables have no schema. Entities in the same table can have
completely different properties, or a different number of properties.
Like denormalization, the onus is on the developer to ensure that
updates reflect the correct schema.
2.3. No distributed transactions
If you are accustomed to using transactions to maintain consistency and integrity, the
idea of not having any transactions can be scary. However, in any
distributed storage system, transactions across machines hurt
performance. As with normalization, the onus is on the developer to
maintain consistency and run scripts to ensure that.
This is not as scary or as difficult as it sounds. Large
services such as Facebook and Flickr have long eschewed transactions
as they scaled out. It’s a fundamental trade-off that you make with
cloud-based storage systems.
Though distributed transactions aren’t available, Windows Azure
tables have support for “entity group transactions” that allows you to batch
requests together for entities in the same partition.
2.4. Black box
If you’ve ever run a database for a service, you’ve probably mucked with its
configuration. The first thing a lot of developers do when setting up
MySQL is dive into my.cnf and
tune the various knobs that are available. Entire books have been
written on tuning indexes, picking storage engines, and optimizing
query plans.
The Azure Table service does not provide you with the individual
knobs to perform that level of tuning. Since it is a large distributed
system, the service itself automatically tunes based on data,
workload, traffic, and various other factors. The only option you have
control over is how your data is partitioned (which will be addressed
shortly). This lack of knobs to turn can be a blessing, since the
system takes care of tuning for you.
2.5. Row size limits
An entity can have only up to 1 MB of data, a number that includes
the names of your properties. If you are used to sticking large pieces
of data in each row (a questionable practice in itself), you might
easily reach this limit. In cases such as this, the right thing to do
is to use the blob storage service, and store a pointer to the blob in
the entity. This is similar to storing large pieces of data in the
filesystem, and having the database maintain a pointer to the specific
file.
2.6. Lack of support for familiar tools
Like other cloud storage systems, the Azure Table service is
pretty nascent. This means the ecosystem around it is nascent, too.
Tools you’re comfortable with for SQL Server, Oracle, or MySQL mostly
won’t work with Azure tables, and you might find it difficult to find
replacements. This is a problem that will be solved over time as more
people adopt the service.